Skip to content

Comments

fix: add per-domain RequestThrottler for 429 backoff#1762

Open
MrAliHasan wants to merge 3 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff
Open

fix: add per-domain RequestThrottler for 429 backoff#1762
MrAliHasan wants to merge 3 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff

Conversation

@MrAliHasan
Copy link

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), the AutoscaledPool scales UP instead of down — creating a "death spiral." This happens because:

  1. 429 responses trigger SessionError → session retires → request retried
  2. Less CPU work during retries → is_system_idle returns True
  3. _autoscale() sees idle CPU → increases concurrency
  4. More concurrent requests → more 429s → repeat

The existing _snapshot_client only tracks Apify storage API rate limits, not target website 429s.

Solution

Following @Pijukatel's suggestion, I created a dedicated RequestThrottler component that handles 429 backoff per domain — the AutoscaledPool is completely untouched.

Key features:

  • Per-domain tracking — rate limiting on example.com doesn't affect other-site.com
  • Exponential backoff2s → 4s → 8s → ... capped at 60s
  • Retry-After header support — parses both integer seconds and HTTP-date formats
  • Throttled requests are reclaimed — they go back to the queue, not dropped
  • Backoff resets on success — consecutive 429 count resets when a request succeeds

How it works

  1. BasicCrawler.__run_task_function checks RequestThrottler.is_throttled(url) before processing
  2. If the domain is throttled, the request is reclaimed (returned to queue for later)
  3. When a 429 is detected in _raise_for_session_blocked_status_code, the domain is recorded
  4. On successful request (RequestState.DONE), the backoff counter resets

Files changed

File Change
src/crawlee/_request_throttler.py NEW — Per-domain 429 tracker
src/crawlee/crawlers/_basic/_basic_crawler.py Throttle check, 429 recording, success reset, Retry-After parsing
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Pass URL + Retry-After header
src/crawlee/crawlers/_playwright/_playwright_crawler.py Pass URL + Retry-After header
tests/unit/test_request_throttler.py NEW — 13 unit tests

Tests

  • 13 new tests covering: domain independence, exponential backoff, max delay cap, Retry-After priority, success reset, expiry, edge cases
  • 8 existing autoscaling tests pass with zero regressions

Future work

This is a focused first step toward the full RequestAnalyzer that @Pijukatel outlined (with robots.txt integration, URL group management, etc.).

Add a new RequestThrottler component that handles HTTP 429 (Too Many
Requests) responses on a per-domain basis, preventing the autoscaling
death spiral where 429s cause concurrency to increase.

Key features:
- Per-domain tracking: rate limiting on domain A doesn't affect domain B
- Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s
- Retry-After header support (both seconds and HTTP-date formats)
- Throttled requests are reclaimed to the queue, not dropped
- Backoff resets on successful requests to that domain

The AutoscaledPool is completely untouched - throttling happens
transparently in BasicCrawler.__run_task_function before processing.

Integration points:
- BasicCrawler: throttle check, 429 recording, success reset
- AbstractHttpCrawler: passes URL + Retry-After to detection
- PlaywrightCrawler: passes URL + Retry-After to detection

Closes apify#1437
@vdusek
Copy link
Collaborator

vdusek commented Feb 23, 2026

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

Comment on lines 1591 to 1592
@staticmethod
def _parse_retry_after_header(value: str | None) -> timedelta | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no business being in BasicCrawler. Better put it in the _utils module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.

Comment on lines 1401 to 1408
# Check if this domain is currently rate-limited (429 backoff).
if self._request_throttler.is_throttled(request.url):
self._logger.debug(
f'Request to {request.url} delayed - domain is rate-limited '
f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)'
)
await request_manager.reclaim_request(request)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.

I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.

@MrAliHasan
Copy link
Author

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function)
to scheduling layer (ThrottlingRequestManager.fetch_next_request).

- ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface
- fetch_next_request() buffers throttled requests and asyncio.sleep()s when all
  domains are throttled — eliminates busy-wait and unnecessary queue writes
- Unified delay mechanism supports both HTTP 429 backoff and robots.txt
  crawl-delay (apify#1396)
- parse_retry_after_header moved to crawlee._utils.http
- 23 new tests covering throttling, scheduling, delegation, and crawl-delay

Addresses apify#1437, apify#1396
@janbuchar janbuchar self-requested a review February 24, 2026 10:22
Comment on lines 1455 to 1456
if self._throttling_manager:
self._throttling_manager.record_success(request.url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use isinstance so that an explicitly configured ThrottlingRequestManager also works.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Updated to check isinstance(manager, ThrottlingRequestManager) accurately.

Args:
session: The session used for the request. If None, no check is performed.
status_code: The HTTP status code to check.
url: The request URL, used for per-domain rate limit tracking.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the parameter should be renamed to request_url so that it's not ambiguous with a URL after redirects.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed to request_url across BasicCrawler, AbstractHttpCrawler, and PlaywrightCrawler.

"""Raise an exception if the given status code indicates the session is blocked.

If the status code is 429 (Too Many Requests), the domain is recorded as
rate-limited in the `RequestThrottler` for per-domain backoff.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated name

Comment on lines 1591 to 1592
# NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header

if not robots_txt_file:
return True

# Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396).
# Wire robots.txt crawl-delay into ThrottlingRequestManager

configuration=self._service_locator.get_configuration(),
)
self._throttling_manager = ThrottlingRequestManager(inner)
self._request_manager = self._throttling_manager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should use ThrottlingRequestManager by default - thoughts, @vdusek @Pijukatel, @Mantisus?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to enable it by default. When the crawler runs without a proxy, 429 will only increase the load on the site. This will not benefit either the site or the crawler.

Using a proxy requires a little configuration, so I think that an additional parameter to disable throttling for 429 will not complicate this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the tests conform to the existing test structure. Perhaps @vdusek could provide further guidance here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I've removed all the Test... classes and rewritten the entirety of test_throttling_request_manager.py to use standard top-level async def test_...() functions with pytest fixtures to match test_request_list.py.

Comment on lines +1 to +8
"""A request manager wrapper that enforces per-domain delays.

Handles both HTTP 429 backoff and robots.txt crawl-delay at the scheduling layer,
eliminating the busy-wait problem described in https://github.com/apify/crawlee-python/issues/1437.

Also addresses https://github.com/apify/crawlee-python/issues/1396 by providing a unified
delay mechanism for crawl-delay directives.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention github issues here if they will be resolved by this PR.

"""
self._inner = inner
self._domain_states: dict[str, _DomainState] = {}
self._buffered_requests: list[Request] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fundamental deviation from what I described - instead of storing the requests in memory, we should create a special request queue in memory (see #1762 (review)) for each domain. The ThrottlingRequestManager will then delegate fetch_next_request calls to sub-queues that are not currently delayed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. This has been fully refactored. ThrottlingRequestManager now dynamically instantiates RequestQueue.open(alias=f"throttled-{domain}") for throttled domains. Requests are routed to these separate sub-queues to avoid memory pressure, and fetch_next_request pulls from them seamlessly when domain delays expire. State operations (like is_empty and get_handled_count) now aggregate cleanly across the inner queue and all sub-queues. This aligns with Crawlee's distributed design and ensures request persistence.

…queues and update its integration across crawlers.
@MrAliHasan
Copy link
Author

Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback.

Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory.

Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure.

BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments).

The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix autoscaled pool scaling behavior on 429 Too Many Requests

4 participants